Prioritized Experience Replay

1 Overview

Prioritized experience replay was proposed by T. Schaul et. al., and is widely used to speed up reinforcement learning (as far as I know).

Roughly speaking, mis-predicted observations will be learned more frequently. To compensate distorted probability, weight of learning is scaled to the opposite direction (cf. importance sampling).

In cpprb, PrioritizedReplayBuffer class implements these functionalities with proportional base (instead of rank base) priorities as shown at bellow table.

\(i\)-th Definition
Probability: \( P(i) \) \( \frac{(P_{i}+\epsilon)^{\alpha}}{\sum _{j=0}^{N} (P_{j}+\epsilon)^{\alpha}} \)
Weight: \( w_i \) \( \frac{\left( \frac{1}{N} \frac{1}{P(i)}\right) ^{\beta}}{\left( \frac{1}{N} \frac{1}{\min _{j} P(j)}\right) ^{\beta}} \to \left(\frac{\min _{j} P(j)}{P(i)}\right) ^{\beta} \)

You can add priorities together with other environment. If no priorities are passed, the stored maximum priority is used.

The dict returned by sample also has special key-values of indexes and weights. The indexes are intended to be passed to update_priorities to update their priorities after comparison with new prediction. The weights can be used to scale errors when updating network parameters. Since the weights are normalized by the largest weight, as the original research did, the values are always less than or equal to 1.

PrioritizedReplayBuffer has hyperparameters alpha (\( \alpha \)) and eps (\( \epsilon \)) at constructor and beta (\( \beta \)) at sample method. Their default values are 0.6, 1e-4, and 0.4, respectively. The detail is described in the original paper above.

Parameters Default Description
alpha (\( \alpha \)) \( 0.6 \) Prioritized parameter. 0 means uniform sampling.
beta (\( \beta \)) \( 0.4 \) Compensation parameter. 1 means fully compensate (i.e. Importance Sampling)
eps (\( \epsilon \)) \( 10^{-4} \) Small value to avoid 0 priority.

2 Example Usage

import numpy as np
from cpprb import PrioritizedReplayBuffer

buffer_size = 256

prb = PrioritizedReplayBuffer(buffer_size,
                              {"obs": {"shape": (4,4)},
                               "act": {"shape": 3},
                               "rew": {},
                               "next_obs": {"shape": (4,4)},
                               "done": {}},
                              alpha=0.5)

for i in range(1000):
    prb.add(obs=np.zeros((4,4)),
            act=np.ones(3),
            rew=0.5,
            next_obs=np.zeros((4,4)),
            done=0)

batch_size = 32
s = prb.sample(batch_size,beta=0.5)

indexes = s["indexes"]
weights = s["weights"]

#  train
#  ...

new_priorities = np.ones_like(weights)
prb.update_priorities(indexes,new_priorities)

3 Notes

The constructor of PrioritizedReplayBuffer (aka. __init__) has parameter check_for_update. If the parameter is True (default is False), the buffer traces updated indices after the last calling of sample() method and the update_priorities() method updates priorities only at unchanged indices. This feature is designed for multiprocess learning in order to avoid mis-updating priorities of already overwritten values. (This protection is always enabled at MPPrioritizedReplayBuffer)

4 Technical Detail

To choose prioritized sample efficiently, partial summation and minimum of pre-calculated weights are stored in Segment Tree data structure, which is written by C++ and which was an initial main motivation of this project.

To support multiprocessing, the Segment Tree can be lazily updated, too. (MPPrioritizedReplayBuffer)

Minibatch sampling is done by stratified sampling, which samples from every equal probability space, resulting smaller distributions among minibatches.